###Practical 1: Pre-processing of Text Document Aim: To perform text pre-processing by removing stopwords and tokenizing a sentence. Description: Stopwords like “is, the, in” are common words that carry little meaning. Removing them helps focus on important words. Tokenization splits text into smaller units called tokens. ##Code – Stopword Removal: import nltk nltk.download('stopwords') from nltk.corpus import stopwords # Load English stopwords stop_words = set(stopwords.words('english')) # Display stopwords print(stop_words) ##Code – Tokenization and Filtering: import nltk nltk.download('punkt') nltk.download('stopwords') from nltk.corpus import stopwords from nltk.tokenize import word_tokenize example_sent = "This is a sample sentence, showing off the stop words filtration." stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(example_sent) filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(word_tokens) print(filtered_sentence) ###Practical 2: Boolean Retrieval Model Aim: To implement the Boolean Retrieval Model for given documents. Description: Documents are represented as sets of words. Boolean operators used: AND ( ∧ ) → both terms must be present OR ( ∨ ) → at least one term present NOT ( ¬ ) → excludes documents with the term Code: import re docs = { 1: "Information Retrieval has 2 models and information.", 2: "Boolean is a basic Information Retrieval classic model.", 3: "Information is a data that processed, Information.", 4: "When a Data Processed the result is Information, Data." } # All document IDs all_docs = set(docs.keys()) # Function to check if a term exists in a document def has_term(doc, term): words = set(re.findall(r'\w+', doc.lower())) return term in words # Collect documents containing each term data_docs = set(i for i, doc in docs.items() if has_term(doc, "data")) info_docs = set(i for i, doc in docs.items() if has_term(doc, "information")) retrieval_docs = set(i for i, doc in docs.items() if has_term(doc, "retrieval")) # Evaluate query: (Data AND Information) OR (NOT Retrieval) result = (data_docs & info_docs) | (all_docs - retrieval_docs) print("Result for Query: (Data ^ Information) v (~ Retrieval)\n") for i in sorted(result): print(f"Doc{i}:", docs[i]) ###Practical 3: Vector Space Model Aim: To implement Vector Space Model using cosine similarity. Description: Documents and queries are represented as vectors. Cosine similarity measures similarity: 1 → very similar, 0 → not similar. Code: import math, collections docs = ["A man and a woman.", "A baby."] vocab = sorted(set(w.lower().strip('.,') for d in docs for w in d.split())) def vectorize(text): c = collections.Counter(w.lower().strip('.,') for w in text.split()) return [c[t] for t in vocab] def cosine_sim(a, b): dot = sum(x * y for x, y in zip(a, b)) mag_a = math.sqrt(sum(x * x for x in a)) mag_b = math.sqrt(sum(y * y for y in b)) return dot / (mag_a * mag_b) if mag_a and mag_b else 0 query = "woman" q_vec = vectorize(query) doc_num = 1 for doc in docs: sim = cosine_sim(vectorize(doc), q_vec) print("Doc", doc_num, "similarity:", round(sim, 3)) doc_num += 1 ###Practical 4: Web Spamming Detection Aim: To detect keyword stuffing and web spam. Description: Web spamming uses repeated keywords or hidden content to manipulate search rankings. We can count word frequencies to detect potential spam. Code: from collections import Counter import re web_content = """Cheap watches available now! Best cheap watches for you. Buy cheap watches online. Cheap cheap cheap watches watches!""" # Extract words (lowercase, ignore punctuation) words = re.findall(r'\b\w+\b', web_content.lower()) # Count word frequencies keyword_counts = Counter(words) # Threshold for spam detection SPAM_THRESHOLD = 4 # Print frequencies print("Keyword Frequencies:") for word, count in keyword_counts.items(): print(f"{word}: {count}") # Detect potential spam keywords print("\nPotential Spam Keywords:") for word, count in keyword_counts.items(): if count >= SPAM_THRESHOLD: print(f"'{word}' appears {count} times (possible keyword stuffing)") ###Practical 5: Text Summarization Aim: To implement extractive and abstractive text summarization. Description: Summarization reduces long text into a shorter version. Extractive → selects important sentences Abstractive → generates new sentences Code: # Install necessary packages (for Jupyter/Colab) !pip install sumy transformers torch import nltk nltk.download('punkt_tab') from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lex_rank import LexRankSummarizer from transformers import pipeline # Sample text text = """ Artificial intelligence (AI) is the technology that allows machines to simulate human intelligence, enabling them to learn, reason, problem-solve, and make decisions. AI systems achieve this by analyzing vast amounts of data to identify patterns, understand language, and recognize objects, similar to how humans think and behave. Key applications of AI include natural language processing, computer vision, and autonomous systems, impacting various industries by automating tasks and improving decision-making. The four common types of Artificial Intelligence (AI), based on their functionality, are: Reactive Machines, Limited Memory, Theory of Mind, and Self-aware AI. Reactive machines, like the IBM Deep Blue chess program, act on the present situation but don't store memories. Limited Memory AI, such as selfdriving cars, can use past data to inform current decisions. Theory of Mind and Self-aware AI are currently conceptual stages of AI that would possess human-like understanding of emotions and consciousness, respectively. """ # Extractive Summarization def extractive_summary(text, num_sentences=3): parser = PlaintextParser.from_string(text, Tokenizer("english")) summarizer = LexRankSummarizer() summary = summarizer(parser.document, num_sentences) return ' '.join(str(sentence) for sentence in summary) # Abstractive Summarization def abstractive_summary(text): summarizer = pipeline("summarization") summary = summarizer(text, max_length=100, min_length=25, do_sample=False) return summary[0]['summary_text'] # Run both summarizations print("===== Abstractive Summary =====") print(abstractive_summary(text)) print("\n===== Extractive Summary =====") print(extractive_summary(text)) ###Practical 6: Inverted Index Aim: To create an inverted index for document search. Description: An inverted index maps each word to documents in which it occurs. It allows fast lookup for search engines. Code: from collections import defaultdict # Function to create an inverted index def create_inverted_index(documents): inverted_index = defaultdict(set) for doc_id, document in enumerate(documents): for word in document.split(): inverted_index[word].add(doc_id) return inverted_index # Sample documents documents = [ "This is the first document.", "Second document is here.", "And this is the third document." ] # Create inverted index inverted_index = create_inverted_index(documents) # Display the inverted index print(dict(inverted_index)) ###Practical 7: Opinion Spam Detection Aim: To identify spam reviews in user feedback. Description: Spam reviews use promotional phrases like “Buy now”, “Limited offer”, or “Click here”. Checking reviews for these keywords flags spam. Code: # List of sample reviews reviews = [ "This phone is amazing, battery lasts all day!", "Worst phone ever, waste of money!", "Buy this product now!!! Limited offer!!!", "Great value for money, highly recommended!" ] # List of spam keywords spam_words = ["buy now", "limited offer", "click here", "free"] # Detect spam reviews for r in reviews: if any(word in r.lower() for word in spam_words): print("SPAM:", r) else: print("GENUINE:", r)